JLC 445/696 Final Project Template

04 December 2024

Your Name

The Final Project is an extension of Project 3, that also involves the integration of work you’ve done throughout this semester and beyond. Use the format provided in this document as much or as little as you’d like, as long as the required components are covered.

If you want to get crazy with Markdown themes, here’s a great link.

1. Research Question

The goal of this project is to forecast future crime activity in Washington, D.C. for calendar year 2025.

These instructions will demonstrate how to forecast using an ARIMA model. An ARIMA model is an Autoregressive Integrated Moving Average, which is a univariate model predicting future values of a single variable over time. Let’s breakdown these terms:

Your research question should consider behavior, space, and time. You’ll either measure or control for each of these.

For example, using 2008-2024 data, forecast robberies counts and locations in Washington, D.C. for each month in 2025.

2. Data

Acquire

  1. Load the libraries you need for success:
library(tidyverse)
library(leaflet)
library(sf)
library(tigris)
library(zoo)
library(aTSA)
library(tseries)
library(forecast)
  1. Get the latest and greatest D.C. crime data, like we did in Projects 2 and 3:
url <- "https://opendata.dc.gov/datasets/DCGIS::crime-incidents-in-"
years <- c(2008:2024)
full.urls <- paste0(url, years, ".csv")
dc.data <- data.frame()

for(file in full.urls) {
  tmp.data <- read.csv(file, stringsAsFactors = FALSE)
  dc.data <- rbind(tmp.data, dc.data)
}

Wrangle

  1. Separate the date, format it, and enrich the dataset with more temporal measures:
dc.data <- separate(dc.data, REPORT_DAT, into = c("DATE", "TIME"), sep = " ")
dc.data$DATE <- as.Date(dc.data$DATE, format = "%Y/%m/%d")
dc.data$YEAR <- substr(dc.data$DATE, 0, 4)
dc.data$MONTH <- month(dc.data$DATE)
dc.data$MONTHS <- months(dc.data$DATE)
dc.data$DAY <- day(dc.data$DATE)
dc.data$DOW <- weekdays(dc.data$DATE)
dc.data$HOUR <- substr(dc.data$TIME, 0, 2)
  1. Determine the total count and percentage of each crime type:
dc.sum <- dc.data %>%
  group_by(OFFENSE) %>%
  summarise(CRIME.COUNT = n()) %>%
  mutate(PCT = round(CRIME.COUNT/sum(CRIME.COUNT)*100,2))
  1. Use the table from #4 to find and filter the crime type(s) to forecast (choose your own!):
group <- c("ROBBERY")
dc.subset <- dc.data %>% filter(dc.data$OFFENSE %in% group)

3. Methods

Building a relevant methodology is largely dependent on your research question and data. Any good methodology involves multiple steps for acquiring (done above), wrangling (done above), calculating, visualizing, and analyzing data. There is no single right answer here; rather, its the ability to craft a unique, distinct workflow that results in innovative work.

To use an ARIMA model, we will identify values for p, d, and q:

Calculate

  1. Calculate crimes per day:
crime <- dc.subset %>%
  group_by(DATE) %>%
  summarise(CRIME.COUNT = n())
  1. Fill in blank days:
crime <- crime %>%
  complete(DATE = seq.Date(min(DATE), max(DATE), by = "day")) %>%
  mutate(WEEKDAY = lubridate::wday(DATE, label = T, week_start = 1), 
         MONTH = lubridate::month(DATE, label = T, abbr = F),
         WEEK = isoweek(DATE),
         DAY = day(DATE),
         YEAR = year(DATE))
# replace the NAs with 0s
crime <- replace(crime, is.na(crime), 0)
  1. Change the data type to a proper time series, update the sequencing:
cleaned.crime <- zoo(crime$CRIME.COUNT, 
                       seq(from = as.Date(min(crime$DATE)), 
                           to = as.Date(max(crime$DATE)-1), by = 1))
  1. Generate basic summary stats and a basic graph:
summary(cleaned.crime)
##      Index            cleaned.crime   
##  Min.   :2008-01-01   Min.   : 0.000  
##  1st Qu.:2012-03-25   1st Qu.: 5.000  
##  Median :2016-06-17   Median : 8.000  
##  Mean   :2016-06-17   Mean   : 8.301  
##  3rd Qu.:2020-09-09   3rd Qu.:11.000  
##  Max.   :2024-12-02   Max.   :35.000
plot(cleaned.crime)
title("My Crime Per Day") # this adds a title to your graph

  1. Make the data stationary (normalized, as a deviation from previous values), and a new basic graph. You can do more/higher differences as necessary (but probably don’t need to). Examine the graphs and compare the spikes. These are just examples:
stationary1 <- diff(cleaned.crime, differences = 1)
plot(stationary1)

stationary2 <- diff(cleaned.crime, differences = 2)
plot(stationary2)

stationary3 <- diff(cleaned.crime, differences = 3)
plot(stationary3)

Look for the data with the narrowest y axis range, as consistently as close to 0 as possible. In this example, stationary1 is the best dataset.

  1. Conduct an Augmented Dickey-Fuller test to determine if the data is stationary. The resultant score should be a negative number, and the lower the number the stronger the data is stationary. In this example, ‘cleaned.calls’ is run as a comparison to data that is not differenced.
adf.test(as.matrix(cleaned.crime))
## Warning in adf.test(as.matrix(cleaned.crime)): p-value smaller than printed
## p-value
## 
##  Augmented Dickey-Fuller Test
## 
## data:  as.matrix(cleaned.crime)
## Dickey-Fuller = -8.2554, Lag order = 18, p-value = 0.01
## alternative hypothesis: stationary
adf.test(as.matrix(stationary1))
## Warning in adf.test(as.matrix(stationary1)): p-value smaller than printed
## p-value
## 
##  Augmented Dickey-Fuller Test
## 
## data:  as.matrix(stationary1)
## Dickey-Fuller = -28.316, Lag order = 18, p-value = 0.01
## alternative hypothesis: stationary
adf.test(as.matrix(stationary2))
## Warning in adf.test(as.matrix(stationary2)): p-value smaller than printed
## p-value
## 
##  Augmented Dickey-Fuller Test
## 
## data:  as.matrix(stationary2)
## Dickey-Fuller = -37.643, Lag order = 18, p-value = 0.01
## alternative hypothesis: stationary
adf.test(as.matrix(stationary3))
## Warning in adf.test(as.matrix(stationary3)): p-value smaller than printed
## p-value
## 
##  Augmented Dickey-Fuller Test
## 
## data:  as.matrix(stationary3)
## Dickey-Fuller = -47.749, Lag order = 18, p-value = 0.01
## alternative hypothesis: stationary

In this example, all scenarios are negative. When combined with the plot analysis in Step 5, stationary1 is best. Ideally you want to manipulate the data as little as possible. While ‘stationary2’ and ‘stationary3’ are even more negative, they are not negative enough to warrant choosing them.

  1. Lagged autocorrelations - specifically using the Autocorrelation (ACF) and Partial Autocorrelation (PACF) functions - will help determine the p and q values. For both calculations and visuals, you will use the best stationary dataset you chose in Steps 5 and 6 above.
pacf(stationary1)

pacf(stationary1, pl=FALSE)
## 
## Partial autocorrelations of series 'stationary1', by lag
## 
##      1      2      3      4      5      6      7      8      9     10     11 
## -0.455 -0.318 -0.225 -0.163 -0.182 -0.180 -0.113 -0.068 -0.081 -0.060 -0.049 
##     12     13     14     15     16     17     18     19     20     21     22 
## -0.082 -0.099 -0.036 -0.018 -0.042 -0.044 -0.024 -0.039 -0.075 -0.038 -0.035 
##     23     24     25     26     27     28     29     30     31     32     33 
## -0.021 -0.005 -0.023 -0.040 -0.051 -0.021  0.001 -0.019 -0.015  0.001 -0.009 
##     34     35     36     37 
## -0.044  0.003  0.004 -0.016
acf(stationary1)

acf(stationary1, pl=FALSE)
## 
## Autocorrelations of series 'stationary1', by lag
## 
##      0      1      2      3      4      5      6      7      8      9     10 
##  1.000 -0.455 -0.045  0.011  0.005 -0.040  0.003  0.042  0.002 -0.030  0.016 
##     11     12     13     14     15     16     17     18     19     20     21 
## -0.001 -0.030  0.002  0.050 -0.013 -0.027  0.010  0.013 -0.024 -0.015  0.044 
##     22     23     24     25     26     27     28     29     30     31     32 
## -0.012 -0.001  0.007 -0.017 -0.011  0.002  0.029  0.000 -0.024  0.012  0.007 
##     33     34     35     36     37 
## -0.020 -0.017  0.051 -0.017 -0.019

In this example, the p should be 6, and the q should be 1.

  1. Build an ARIMA function using the inputs derived from previous steps (based on analysis from the graph in Step 4, it’s reasonable to assume our data has season trends - so we’ll set it to TRUE):
arima.crime <- auto.arima(cleaned.crime, d = 1, max.p = 6, max.q = 1, seasonal = T)
summary(arima.crime)
## Series: cleaned.crime 
## ARIMA(1,1,1) 
## 
## Coefficients:
##          ar1      ma1
##       0.0917  -0.9387
## s.e.  0.0138   0.0052
## 
## sigma^2 = 11.56:  log likelihood = -16331.87
## AIC=32669.75   AICc=32669.75   BIC=32689.94
## 
## Training set error measures:
##                        ME     RMSE      MAE  MPE MAPE      MASE          ACF1
## Training set -0.008816429 3.399132 2.653829 -Inf  Inf 0.7677494 -0.0004144812

Note that the ‘auto.arima’ function runs a series of ARIMA models and choose the best fit for the data. The best fitting model can be interpreted as ARIMA(d, p, q).

  1. Do a quick residuals checks on the model to see how it performed:
checkresiduals(arima.crime)

## 
##  Ljung-Box test
## 
## data:  Residuals from ARIMA(1,1,1)
## Q* = 31.849, df = 8, p-value = 9.914e-05
## 
## Model df: 2.   Total lags used: 10

Proper residuals derived from your model include:

  1. Use the ARIMA function to do a sample forecast for the next week of data:
# h = the number of units of time to measure
forecast.7days <- forecast(arima.crime, h=7)
round(sum(forecast.7days$upper[,2]),0)
## [1] 91
forecast.7days$mean
## Time Series:
## Start = 20060 
## End = 20066 
## Frequency = 1 
## [1] 6.651305 6.161095 6.116167 6.112050 6.111672 6.111638 6.111634
  1. Identify the number of days between the last date in your dataset and the end of 2025:
forecast.window <- as.numeric(as.Date("2025-12-31")-max(crime$DATE))
  1. Forecast the number of calls per day for everyday in that window, and plot it:
forecast.2025 <- forecast(arima.crime, h=forecast.window)
autoplot(forecast.2025)

  1. Extract the forecasted values as a table, clean up the column names, add the forecast date, and month:
forecast.values <- as.data.frame(forecast.2025$mean)
forecast.values$ID <- seq.int(nrow(forecast.values))
forecast.upper <- as.data.frame(forecast.2025$upper)
forecast.upper$ID <- seq.int(nrow(forecast.upper))
forecast.values <- forecast.values %>%
  left_join(forecast.upper, by = 'ID')
colnames(forecast.values) <- c("MEAN", "ID", "CI80", "CI90")
forecast.values$DATE <- as.Date(max(crime$DATE) + forecast.values$ID)
forecast.values$MONTH <- months(forecast.values$DATE)
  1. Filter to 2025, summarize forecasts by month:
forecast.values.2025 <- subset(forecast.values, forecast.values$DATE > '2024-12-31')
forecast.months <- forecast.values.2025 %>%
  group_by(MONTH) %>%
  summarise(MEAN = round(sum(MEAN),0), FORECAST.90 = round(sum(CI90),0), FORECAST.80 = round(sum(CI80),0))
forecast.months$DIFF <- forecast.months$FORECAST.90 - forecast.months$FORECAST.80

Visualize

  1. Graph of crime with a trend line:
graph.crime <- ggplot(crime, aes(x=DATE, y=CRIME.COUNT)) + 
  geom_line() + 
  scale_x_date(date_breaks = "1 year", date_labels = "%Y") + 
  xlab("Years") + 
  ylab("Crime Count") + 
  ggtitle("TITLE HERE") + 
  geom_area(fill="lightblue", color="black")

graph.crime + 
  geom_smooth(method = lm, col = "red", se = FALSE) +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

  1. Graph actual crime with predicted crime:
crime2024 <- crime[c(1,2)]
crime2025 <- forecast.values[c(5,1)]
names(crime2025) <- c("DATE", "CRIME.COUNT")

new.crime <- rbind(crime2024, crime2025)

graph.new.crime <- ggplot(new.crime, aes(x=DATE, y=CRIME.COUNT)) + 
  geom_line() + 
  scale_x_date(date_breaks = "1 year", date_labels = "%Y") + 
  xlab("Years") + 
  ylab("Crime Count") + 
  ggtitle("TITLE HERE") + 
  geom_area(fill="lightblue", color="black")

graph.new.crime + 
  geom_smooth(method = lm, col = "red", se = FALSE) +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))
## `geom_smooth()` using formula = 'y ~ x'

  1. Tweak the data table for making a graph:
forecast.months$MONTH <- factor(forecast.months$MONTH, levels = forecast.months$MONTH)
forecast.long <- pivot_longer(forecast.months, cols = c(FORECAST.80, DIFF), 
                          names_to = "Category", values_to = "Value")
forecast.long$Category <- gsub("DIFF", "90% Confidence Interval", forecast.long$Category)
forecast.long$Category <- gsub("FORECAST.80", "80% Confidence Interval", forecast.long$Category)
  1. Graph your monthly forecasts, first for the upper bounds, and then for the mean:
ggplot(forecast.long, aes(x = MONTH, y = Value, fill = fct_rev(Category))) +
  geom_bar(stat = "identity") +
  geom_text(aes(label = Value), size = 3, colour = 'white', position = position_stack(vjust = 0.5)) + 
  labs(title = "Washington D.C. 2025 Monthly Forecast Upper Bounds", 
       x = "Month", 
       y = "Crime Count") +
  scale_fill_manual(values = c("90% Confidence Interval" = "blue",
                               "80% Confidence Interval" = "grey"),
                    name = "Forecasts") +
  scale_x_discrete(limits = c("January", "February", "March", "April", "May", "June", "July", "August",
                              "September", "October", "November", "December")) + 
  coord_cartesian(ylim = c(0, 550)) +
  theme_classic() +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

ggplot(forecast.months, aes(x = MONTH, y = MEAN)) +
  geom_bar(stat = "identity") +
  scale_x_discrete(limits = c("January", "February", "March", "April", "May", "June", "July", "August",
                              "September", "October", "November", "December")) +
  coord_cartesian(ylim = c(150, 200)) +
  theme_classic() +
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
  labs(title = "Washington D.C. 2025 Mean Monthly Forecast", 
       x = "Month", 
       y = "Crime Count")

  1. Map your monthly hotspots.
dc.roads <- roads("DC", "District of Columbia")
dc.roads.major <- dc.roads %>%  filter(RTTYP %in% c("I","S","U"))
dc.outline <- county_subdivisions("DC", "District of Columbia")
ggplot() +
  geom_sf(data = dc.outline, color = "grey") +
  geom_hex(aes(x = LONGITUDE, y = LATITUDE), data = dc.subset, bins = 8) +
  scale_fill_continuous(type = "viridis") +
  geom_sf(data = dc.roads, color = "black", alpha = 0.4) +
  theme_classic() +
  facet_wrap(~ MONTH)

4. Findings

INSTRUCTIONS

Writing Tips

Related, some general notes about writing. Welcome to grad school, where the quality of your writing is held to a higher standard. Write directly. Thus, avoid using these words and phrases:

Also, words have meanings. Some are stronger than others. Others aren’t real.

Further, here’s some other strategies to consider:

And finally, this is just fun.

Creating the Output

  1. When submitting your projects, you are submitting either an HTML or PDF output of your Markdown file, as an upload to Canvas - not via email.
  2. When you are ready to see what the output looks like, click the ‘Knit’ button. The output file will be saved to the same folder this script is saved to.
  3. Never ever ever include any install.packages functions. Either comment them out with a #, or delete them.

Grading: 10pts

1. Research Question

1pt

Provide a brief paragraph (1-2 sentences) on the research question. State the question, in terms of behavior, space, and time. Define key terms and concepts, to include how variables will be measured.

2. Data

1pt

Provide a brief paragraph on the data used in this study. This includes the specific source(s), and the specific temporal and spatial constraints. Address any major advantages and disadvantages with using these data compared to other sources. Be specific enough that a reader could replicate this work.

3. Methods

1pt

Provide a brief paragraph on the research methods leveraged to answer this question. This includes the software, calculations, skills, techniques, and unique workflows used to analyze your data and develop an answer. You do NOT need to describe click-by-click instructions or lines of code describing how you did things; you DO need to describe a logical process that is specific enough that a reader could replicate.

4. Findings

3pts

Provide two paragraphs, 4-5 sentences each (1.5pts each). Paragraph 1 should describe your forecast: what do you expect to occur during 2025 - the highs, the lows, the norms, etc. Describe the count of events and likely locations (specific areas of the city).

Paragraph 2 should be an integration. Using your understanding of criminology, make sense of this forecast. Provide some context to the numbers and locations. Not necessarily why these crimes will occur, but what are the factors influencing your model. Provide at least one reference/citation to a peer-reviewed/academic research study that supports your response.

5. Visuals

3pts

6. Formatting

1pt

ONE LAST THING… the source code for creating this file can be found here.

Please email me with any questions.